Chinese Word Segmentation at Peking University

نویسندگان

  • Huiming Duan
  • Xiaojing Bai
  • Baobao Chang
  • Shiwen Yu
چکیده

Word segmentation is the first step in Chinese information processing, and the performance of the segmenter, therefore, has a direct and great influence on the processing steps that follow. Different segmenters will give different results when handling issues like word boundary. And we will present in this paper that there is no need for an absolute definition of word boundary for all segmenters, and that different results of segmentation shall be acceptable if they can help to reach a correct syntactic analysis in the end. Keyword: automatic Chinese word segmentation, word segmentation evaluation, corpus, natural language processing

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Two-stage Statistical Word Segmentation System for Chinese

In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and wordformation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.

متن کامل

Towards a Hybrid Model for Chinese Word Segmentation

This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...

متن کامل

A Maximum Entropy Approach to Chinese Word Segmentation

We participated in the Second International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSR), and Peking University (PKU). Based on a maximum entropy approach, our word segmenter achieved the highest F measure for AS, CITYU, ...

متن کامل

Chinese Word Segmentation with Maximum Entropy and N-gram Language Model

This paper presents the Chinese word segmentation systems developed by Speech and Hearing Research Group of National Laboratory on Machine Perception (NLMP) at Peking University, which were evaluated in the third International Chinese Word Segmentation Bakeoff held by SIGHAN. The Chinese character-based maximum entropy model, which switches the word segmentation task to a classification task, i...

متن کامل

An integrated approach for Chinese word segmentation

This paper presents an integrated approach for Chinese word segmentation, which can perform disambiguation and unknown word identification simultaneously on the input. In this work, a hybrid model is used to score known word candidates and unknown word candidates equally by incorporating the modified word-formation models (viz. word-juncture models and wordformation patterns) into word bigram m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003